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Abstract 

Unsupervised word embeddings have been 
shown to be valuable as features in supervised 
learning problems; however, their role in un¬ 
supervised problems has been less thoroughly 
explored. In this paper, we show that embed¬ 
dings can likewise add value to the problem 
of unsupervised POS induction. In two repre¬ 
sentative models of POS induction, we replace 
multinomial distributions over the vocabulary 
with multivariate Gaussian distributions over 
word embeddings and observe consistent im¬ 
provements in eight languages. We also ana¬ 
lyze the effect of various choices while induc¬ 
ing word embeddings on “downstream” POS 
induction results. 


1 Introduction 


Unsupervised POS induction is the problem of as¬ 
signing word tokens to syntactic categories given 
only a corpus of untagged text. In this paper we 
explore the effect of replacing words with their vec¬ 
tor space embedding^ in two POS induction mod¬ 


els: the classic first-order HMM (Kupiec, 19921 and 


the newly introduced conditional random field au- 
foencoder ( [Ammar ef al., 2014 1. In each model, 
insfead of using a condifional multinomial disfribu- 
fior0fo generafe a word token Wi ^ V given a POS 
fag ti G T, we use a condifional Gaussian disfribu- 
fion and generafe a d-dimensional word embedding 
v^. G given U. 


Unlike 


Yatbaz et al. (2014 1 , we leverage easily obtainable 


and widely used embeddings of word types. 
^Also known as a categorical distribution. 


Our findings suggesf thaf, in bofh models, sub- 
sfanfial improvemenfs are possible when word em¬ 
beddings are used rafher than opaque word types. 
However, the independence assumptions made by 
the model used to induce embeddings strongly de¬ 
termines its effectiveness for POS induction: em¬ 
bedding models that model short-range context are 
more effective than those that model longer-range 
contexts. This result is unsurprising, but it illus¬ 
trates the lack of an evaluation metric that measures 
the syntactic (rather than semantic) information in 
word embeddings. Our results also confirm the con¬ 
clusions of jSirts et al. (2014| ) who were likewise 
able to improve POS induction results, albeit using a 
custom clustering model based on the the distance- 


dependent Chinese restaurant process (Blei and Fra- 
jzier, 20lT] |. 

Our contributions are as follows: (i) reparameter¬ 
ization of token-level POS induction models to use 
word embeddings; and (ii) a systematic evaluation 
of word embeddings with respect to the syntactic in¬ 
formation they contain. 

2 Vector Space Word Embeddings 

Word embeddings represent words in a language’s 
vocabulary as points in a d-dimensional space such 
that nearby words (points) are similar in terms of 
their distributional properties. A variety of tech¬ 
niques for learning embeddings have been proposed, 
e.g., matrix factorization (jPeerwester et al., 1990| 


Dhillon et al., 20111 and neural language modeling 


( Mikolov et al., 2011 jCollobert and Weston, 2008 1. 

For the POS induction task, we specifically 
need embeddings fhaf capfure synfacfic similarifies. 
























Therefore we experiment with two types of embed¬ 
dings that are known for such properties: 


Skip-gram embeddings ( [Mikolov et ah, 2013| ) 
are based on a log bilinear model that predicts 
an unordered set of context words given a target 
word. [Bansal et al. (2014] | found that smaller con¬ 
text window sizes tend to result in embeddings 
with more syntactic information. We confirm this 
finding in our experiments. 


Structured skip-gram embeddings (|Ling et aL 


20151 extend the standard skip-gram embeddings 


( Mikolov et ah, 20131 1 by taking into account the 
relative positions of words in a given context. 


We use the tool word2vecp]and [Ling et al. (2015 I’s 
modified versiorj^ to generate both plain and struc¬ 
tured skip-gram embeddings in nine languages. 


3 Models for POS Induction 


We consider two variants of the HMM as base¬ 
lines: 


• p{wi I fj) is parameterized as a “naive multino¬ 
mial” distribution with one distinct parameter for 
each word type. 


p{wi I ti) is parameterized as a multinomial lo¬ 
gistic regression model with hand-engineered fea¬ 
tures as detailed in (IBerg-Kirkpatrick et ah, 201011. 


Gaussian Emissions. We now consider incorpo¬ 
rating word embeddings in the HMM. Given a tag 
t £ T, instead of generating the observed word 
w £ V, v/e generate the (pre-trained) embedding 
G of that word. The conditional probabil¬ 
ity density assigned to | f follows a multivariate 
Gaussian distribution with mean and covariance 
matrix "Et- 


In this section, we briefly review two classes 
of models used for POS induction (HMMs and 
CRT autoencoders), and explain how to generate 
word embedding observations in each class. We 
will represent a sentence of length £ as w = 
{wi,W 2 ,. •., Wi) £ and a sequence of tags as 
t = ■ ■ ■, ti) £ T^. The embeddings of word 

type w £ V will be written as £W^. 

3.1 Hidden Markov Models 

The hidden Markov model with multinomial emis¬ 
sions is a classic model for POS induction. This 
model makes the assumption that a latent Markov 
process with discrete states representing POS cate¬ 
gories emits individual words in the vocabulary ac¬ 
cording to state (i.e., tag) specific emission distri¬ 
butions. An HMM thus defines the following joint 
distribution over sequences of observations and tags: 
e 

p{w,t) = Y\p{ti I ti-l) X p{Wi I ti) (1) 

i=l 

where distributions | fj_i) represents the transi¬ 
tion probability and p{wi \ U) is the emission prob¬ 
ability, the probability of a particular tag generating 
the word at position 

^https://code.google.com/p/word2vec/ 
^https://github.com/wlinl2/wang2vec/ 
^Terms for the starting and stopping transition probabilities 
are omitted for brevity. 




exp - t^t)) 

( 2 ) 


This parameterization makes the assumption that 
embeddings of words which are often tagged as t 
are concentrated around some point and 

the concentration decays according to the covariance 
matrix 

Now, the joint distribution over a sequence of ob¬ 
servations V = {vwi,^w 2 ■ ■ ■ (which corre¬ 

sponds to word sequence w = {wi,W 2 , ... ,Wi,)) 
and a tag sequence t = {ti,t 2 ■ ■ ■ ,ti) becomes: 

e 

i=l 

We use the Baum-Welch algorithm to fit the pig 
and Et^ parameters. In every iteration, we update 
pig, as follows: 

new Ever = t* I V) X 

'■ Ev£ri:i.i,„,p(4. = b|v) 

where T is a data set of word embedding sequences 
V each of length |v| = £, and p{ti = | v) is the 

^“essentially, all models are wrong, but some are useful” - 
George E. P. Box 























posterior probability of label t* at position i in the 
sequence v. Likewise the update to 5]** is: 


S new _ 

t* — 


EvgTE»=i...£p(^i = r I v) X ss^ 

EverEi=i...rP(^* = I v) 


(4) 


where <5 = v^. — 

3.2 Conditional Random Field Autoencoders 

The second class of models this work extends is 
called CRF autoencoders, which we recently pro¬ 
posed in ( Ammar et al., 2014j ). It is a scalable family 
of models for feature-rich learning from unlabeled 
examples. The model conditions on one copy of 
the structured input, and generates a reconstruction 
of the input via a set of interdependent latent vari¬ 
ables which represent the linguistic structure of in¬ 
terest. As shown in Eq. the model factorizes into 
two distinct parts: the encoding model p{t \ w) and 
the reconstruction model p{w \ f); where w is the 
structured input (e.g., a token sequence), t is the lin¬ 
guistic structure of interest (e.g., a sequence of POS 
tags), and w is a generic reconstruction of the input. 
For POS induction, the encoding model is a linear- 
chain CRF with feature vector A and local feature 
functions f. 


4 Experiments 

In this section, we attempt to answer the following 
questions: 


• §4.1[ Do syntactically-informed word embed¬ 
dings improve POS induction? Which model per¬ 
forms best? 


{ 4.2 What kind of word embeddings are suitable 
for POS induction? 


4.1 Choice of POS Induction Models 

Here, we compare the following models for POS in¬ 
duction: 


Baseline: HMM with multinomial emissions (Ku- 


piec, 19921, 


Baseline: HMM with log-linear emissions (Berg- 
Kirkpatrick et al., 2010| ), 


Baseline: CRF autoencoder with multinomial re¬ 
constructions ( Ammar et ah, 2014| n 

Proposed: HMM with Gaussian emissions, and 

Proposed: CRF autoencoder with Gaussian recon¬ 
structions. 


p{w,t I w) = p{t I w) X p{w I t) 


\w\ 

(X p{w I t) X exp A • ^f(fi,L_i,ru) (5) 
i=l 


In (Ammar et al., 20141, we explored two kinds 
of reconstructions w: surface forms and Brown 
clusters ( Brown et al., 1992) , and used “stupid 
multinomials” as the underlying distributions for re¬ 
generating w. 


Gaussian Reconstruction. In this paper, we 
use d-dimensional word embedding reconstructions 
Wi = v^. G and replace the multinomial dis¬ 
tribution of the reconstruction model with the mul¬ 
tivariate Gaussian distribution in Eq. We again 
use the Baum-Welch algorithm to estimate and 
St* similar to Eq. The only difference is that 
posterior label probabilities are now conditional on 
both the input sequence w and the embeddings se¬ 
quence V, i.e., replace p(L = C | v) in Eq. |^with 
p{ti = t* I w, v). 


Data. To train the POS induction models, we used 
the plain text from the training sections of the 


CoNFF-X shared task (Buchholz and Marsi, 20061 


(for Danish and Turkish), the CoNFF 2007 shared 
task ( |Nivre et al., 2007 1 (for Arabic, Basque, Greek, 
Hungarian and Italian), and the Ukwabelana corpus 


(Spiegler et al., 20101 (for Zulu). For evaluation, 
we obtain the corresponding gold-standard POS tags 
by deterministically mapping the language-specific 
POS tags in the aforementioned corpora to the corre¬ 


sponding universal POS tag set (Petrov et al., 20121. 


This is the same set up we used in (Ammar et al.. 


20141. 


Setup. In this section, we used skip-gram (i.e., 
word2vec) embeddings with a context window 
size = 1 and with dimensionality d = 100, 
trained with the largest corpora for each language 
in (Quasthoff et al., 2006|l, in addition to the plain 


'We use the configuration with best performance which re¬ 
constructs Brown clusters. 

































Arabic Basque Danish Greek Hungarian itaiian Turkish Zuiu Average Arabic Basque Danish Greek Hungarian ilaiian Turkish Zuiu Average 


Figure 1: POS induction results. (V-measure, higher is better.) Window size is 1 for all word embeddings. 
Left: Models which use standard skip-gram word embeddings (i.e., Gaussian HMM and Gaussian CRF 
Autoencoder) outperform all baselines on average across languages. Right: comparison between standard 
and structured skip-grams on Gaussian HMM and CRF Autoencoder. 


text used to train the POS induction models 0 In 
the proposed models, we only show results for es¬ 
timating /Xj, assuming a diagonal covariance ma¬ 
trix St(A:, k) = 0.45VA: G {1,..., (i}0 While the 
CRF autoencoder with multinomial reconstructions 


were carefully initialized as discussed in (Ammaret 
al., 2014[ ), CRF autoencoder with Gaussian recon¬ 
structions were initialized uniformly at random in 
[—1,1]. All HMM models were also randomly ini¬ 
tialized. We tuned all hyperparameters on the En¬ 
glish PTB corpus, then fixed them for all languages. 

Evaluation. We use the V-measure evaluation 
metric ( [Rosenberg and Hirschberg, 2007 ) to eval¬ 
uate the predicted syntactic classes at the token 
level 

Results. The results in Fig. [T] (left) clearly sug¬ 
gest that we can use word embeddings to improve 
POS induction. Surprisingly, the feature-less Gaus¬ 
sian HMM model outperforms the strong feature- 


°We used the corpus/tokenize-anything. sh 
script in the c dec decoder ([Dy er et al., 2010^ to tokenize the 
corpora from ^Quasthoff et al., 2006 1 . The other corpora were 
already tokenized. In Arabic and Italian, we found a lot of 
discrepancies between the tokenization used for inducing word 
embeddings and the tokenization used for evaluation. We 
expect our results to improve with consistent tokenization. 

* Surprisingly, we found that estimating St significantly de¬ 
grades the performance. This may be due to overfitting ^ Shi-| 
[nozaki and Kawahara, 2007 1 . Possible remedies include using 
a prior i Gauvain and Lee, 1994 1 . 

''’We found the V-measure results to be consistent with the 
many-to-one evaluation metric ( [Johnson, 2007^ . We only show 
one set of results for brevity. 


rich baselines: Multinomial Featurized HMM and 
Multinomial CRF Autoencoder. One explanation is 
that our word embeddings were induced using larger 
unlabeled corpora than those used to train the POS 
induction models. The best results are obtained us¬ 
ing both word embeddings and feature-rich models 
using the Gaussian CRF autoencoder model. This 
set of results suggest that word embeddings and 
hand-engineered features play complementary roles 
in POS induction. It is worth noting that the CRF au¬ 
toencoder model with Gaussian reconstructions did 
not require careful initialization^] 


4.2 Choice of Embeddings 


Standard skip-gram vs. structured skip-gram. 

On Gaussian HMMs, structured skip-gram embed¬ 
dings score moderately higher than standard skip- 
grams. And as the context window size gets larger 
the gap widens. The reason may be that structured 
skip-gram embeddings give each position within the 
context window its own project matrix, so the smear¬ 
ing effect is not as pronounced as the window grows 
when compared to the standard embeddings. How¬ 
ever the best performance is still obtained when the 
window is smallF0 


"in jAmmar et al., 2014 1 , we found that careful initialization 
for the CRF autoencoder model with multinomial reconstruc¬ 
tions is necessary. 

"in preliminary experiments, we also compared standard 
skip-gram embeddings to SENNA embeddings ( jCollobert et al.,| 
[201 Ij (which are trained in a semi-supervised multi-task learn¬ 
ing setup, with one task being POS tagging) on a subset of 
the English PTB corpus. As expected, the induced POS tags 
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Figure 2: Effect of window size and embeddings 
type on POS induction over the languages in Fig. [T] 
d = 100. The model is HMM with Gaussian emis¬ 
sions. 


Figure 3: Effect of dimension size on POS induction 
on a subset of the English PTB corpus, w = 1. The 
model is HMM with Gaussian emissions. 


Dimensions = 20 vs. 200. We also varied the 
number of dimensions in the word vectors {d G 
{20, 50,100, 200}). The best V-measure we obtain 
is 0.504 (d = 20) and the worst is 0.460 {d = 100). 
However, we did not observe a consistent pattern as 
shown in Fig.[^ 


Window size = 1 vs. 16. Finally, we varied 
the window size for the context surrounding target 
words {w G {1,2,4,8,16}). w = 1 yields the 
best average V-measure across the eight languages 
as shown in Fig. This is true for both standard 
and structured skip-gram models. Notably, larger 
window sizes appear to produce word embeddings 
with less syntactic information. This result confirms 
the observations of|Bansal et al. (2014]). 


4.3 Discussion 

We have shown that (re)generating word embed¬ 
dings does much better than generating opaque word 
types in unsupervised POS induction. At a high 
level, this confirms prior findings that unsupervised 
word embeddings capture syntactic properties of 
words, and shows that different embeddings capture 
more syntactically salient information than others. 
As such, we contend that unsupervised POS induc¬ 
tion can be seen as a diagnostic metric for assessing 
the syntactic quality of embeddings. 

To get a better understanding of what the multi¬ 
variate Gaussian models have learned, we conduct a 
hill-climbing experiment on our English dataset. We 

are much better when using SENNA embeddings, yielding a V- 
measure score of 0.57 compared to 0.51 for skip-gram embed¬ 
dings. Since SENNA embeddings are only available in English, 
we did not include it in the comparison in Eig.[^ 


seed each POS category with the average vector of 
10 randomly sampled words from that category and 
train the model. Seeding unsurprisingly improves 
tagging performance. We also hnd words that are 
the nearest to the centroids generally agree with the 
correct category label, which validate our assump¬ 
tion that syntactically similar words tend to cluster 
in the high-dimensional embedding space. It also 
shows that careful initialization of model parameters 
can bring further improvements. 

However we also find that words that are close 
to the centroid are not necessarily representative 
of what linguists consider to be prototypical. For 
example. Hopper and Thompson (19831 show that 
physical, telic, past tense verbs are more prototyp¬ 
ical with respect to case marking, agreement, and 
other syntactic behavior. However, the verbs near¬ 
est our centroid all seem rather abstract. In English, 
the nearest 5 words in the verb category are entails, 
aspires, attaches, foresees, deems. This may be be¬ 
cause these words seldom serve functions other than 
verbs; and placing the centroid around them incurs 
less penalty (in contrast to physical verbs, e.g. bite, 
which often also act as nouns). Therefore one should 
be cautious in interpreting what is prototypical about 
them. 


5 Conclusion 

We propose using a multivariate Gaussian model to 
generate vector space representations of observed 
words in generative or hybrid models for POS induc¬ 
tion, as a superior alternative to using multinomial 
distributions to generate categorical word types. We 
hnd the performance from a simple Gaussian HMM 
competitive with strong feature-rich baselines. We 



















further show that substituting the emission part of 
the CRF autoeneoder ean bring further improve¬ 
ments. We also eonfirm previous findings whieh 
suggest that smaller eontext windows in skip-gram 
models result in word embeddings whieh eneode 
more syntaetie information. It would be interesting 
to see if we ean apply this approaeh to other tasks 
whieh require generative modeling of textual obser¬ 
vations sueh as language modeling and grammar in- 
duetion. 
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